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Abstract 



We describe two algorithms for computing a sparse solution to a least-squares problem where 
the coefficient matrix can have arbitrary dimensions. We show that the solution vector obtained 
by our algorithms is close to the solution vector obtained via the truncated SVD approach. 



^: 1 Introduction 



Fix inputs A £ flj mxn an d b S M m . We study the following minimization least-squares problem, 



£C! ' m L n l|Ax — t» 1 1 2 - 



xGlf 



o 

Since there is no assumption on m and n, the above problem might have more than one solutions. 
It is well known though that the solution vector which minimizes both ||Ax — t> 1 1 2 and ||x||2 can be 
^ . found using the pseudo-inverse of A, 



x* = A+b = (A T A) 1 A T b. 

When A is ill-conditioned, A+ becomes unstable to perturbations and overfitting can become 
' a serious problem. Practitioners deal with such situations by regularizing the above least-squares 



problem. Popular regularization techniques include the Lasso [10], the Tikhonov regularization [6], 
as well as the truncated SVD regularization [8]. In this work, we focus on the later approach 
and develop a new regularization tool which returns a sparse solution vector that has comparable 
performance to the dense solution vector which is obtained via the truncated SVD. In some more 
details, for k < rank(A), let A& 6 W nxn of rank k denotes the rank-A: SVD of A; then, the truncated 
SVD regularized solution x£ is given by 

x* fc = A+b. 

In the present article, we describe a deterministic and a randomized algorithm that compute a 
solution vector x r S M n with r non-zero entries such that r ~ k and 

||Ax£ — t> 1 1 2 ~ ||Ax r — t> 1 1 2 - 

Our main motivation is interpretability: a sparse vector x r implies that b can be (approximately) 
expressed as a linear combination of a small set of columns from A. On the other hand, any dense 
solution vector x, such as x£, expresses b as a linear combination of (up to) all the columns of A. 



'Similar results to the main results of this article appeared previously in [1]. 
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1.1 Preliminaries 



The Singular Value Decomposition (SVD) of a matrix A G ]g> mxra Q f rank p is 




Here, U fc G R mxk and U p _ fc G R mx (/>- fc ) contain the left singular vectors of A. Similarly, V fc G R nxk 
and Vp^fc G R nx (^~ fc ) contain the right singular vectors. The singular values of A, which we denote 
as (7i (A) > 05(A) > > <7 P (A) > are contained in T, k G R kxk and S p „ fc G R(p- fc ) x (p- fc ). One can 
compute the SVD of A in 0(mnmm{m,n}) time. We use A + = VaS^Ua ^ R nxm to denote the 
Moore-Penrose pseudo-inverse of A with S^ 1 denoting the inverse of £a- Let A^ = UfcX^V^ G R mxn 
and A p _ fc = A - A fc = V p ^ P -kVj-k € R mxn . For fc < rank(A), the SVD gives the best 
rank k approximation to A in both the spectral and the Frobenius norm: for A G R' mxn , let 
rank(A) < k; then, for £ = 2,F, ||A - A fc || 5 < ||A - A|| € . Also, ||A - A fc || 2 = ||S p _ fc || 2 = o fe+ i(A), 
and ||A — A^Hp = ||Sp_fc||p = X^f=fc+i The Frobenius and the spectral norm of A are 

defined as: ||A||| = Ylij^Hj = Yli=i °f(A); and ||A||2 = 01(A). Let X and Y are matrices of 
appropriate dimensions; then, ||XY||f < min{||X||F||Y||2, ||X||2||Y||f}- This is a stronger version of 
the standard submultiplicavity property ||XY|| F < ||X||^||Y||p, which we will refer to as spectral 
submultiplicativity. Recall the least-squares problem min x ||Ax — t» 1 1 2 - Given k < p = rank(A), 
the k truncated SVD regularized weights are x* = A+b = VfcE^Ujb G R n . Note also that 
||b - UfeU^b|| 2 = ||b - A fe A^b|| 2 . Finally, for r < n, let C G R' mxr contains r columns of A. 
We can equivalently write C = Ail, where the sampling matrix is = [e^, . . . , e^] G M. nxr and 
ej G M. m are appropriate vectors from the standard basis. Let S G R rxr be a diagonal rescaling 
matrix with positive entries; then, C = AQS contains a subset of r columns from A rescaled with 
the corresponding diagonal elements in S. 



1.2 Main Results 

Theorem 1. Let A G W nxn , h G R m , < k < rank(A), and < e < 1/2. Algorithm 1 runs in 
time 0(mnm.m{m,n} + nk 3 /e 2 ) and returns x r G W 1 with r = |"9A;/e 2 ] non-zero entries such that, 

||Ax r - b|| 2 < ||Ax* fc - b|| 2 + (1 + e)||b|| 2 l|A ~^ llj? . 

Theorem 2. Let A G W mxn , b G R m , < k < rank(A), and < e < 1/2. Algorithm 3 runs in time 
O (mn min{m, n} + k log k log(£;/e)/e 2 ) and returns x r G M n with r = |" 36k ln(20A;)/e 2 1 non-zero 
entries such that with probability at least 0.7, 

\\A± r - b|| 2 < ||AxS - b|| 2 + e||b|| J* A ~ f" k } F . 

Ofc(A) 

Lemma 3. Fix A G M mxn , b G M. n , rank parameter k < rank(A), and sparsity parameter r > k. 
Let x£ = A^b G R n , where A k G R mxn is the rank k matrix from the SVD of A. Let Q G R nxr and 
S G R rxr be any sampling and rescaling matrices with rank(V^riS) = k. Let x r G R n be a vector 
with r non-zero entries which is obtained as follows: let C = AOS G R mxr and x r = C + b G R r ' ; 
construct x r G R n from x r at the indices corresponding to the selected columns of C and elsewhere. 
Then, 

||Ax r - b|| 2 < ||Ax* fc - b|| 2 + ||(A - A fc )OS(V^S)+S- 1 U^b|| 2 . 

This lemma is the main technical contribution of our work. Combining this lemma with two 
existing algorithms that satisfy its requirements gives our main theorems. More specifically, to 
design our deterministic algorithm (Algorithm 1) we used a method from [2] (Algorithm 2); to 
design our randomized algorithm (Algorithm 3) we used a method from [9] (Algorithm 4). 
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1: Input: A G M mxn , target rank k < rank(A), and accuracy parameter < e < 1/2. 
2: Compute V fc G M nxfc and E = A - A k = A - AV fc Vj G M mxn from the SVD of A. 
3: Let r = \9k/e 2 ], [0,S] = DeterministicSampling(V^, E, r), and C = AQS G M mxr . 
4: Let x r = C + b G ffi r ; and construct x r G R n from x r at the indices corresponding to 

the selected columns C and elsewhere. 
5: Return x r G M. n with at most r non-zero entries. 



Algorithm 1: A Deterministic Sparse Solver for Least-Squares 



1 


Input: V T = [vi, . . . ,v n ] G R kxn with G R k , E = [ei, 
e, G R m , and r > k. 


..,e n ] eK rax " with 


2 


Initialize Bo = Ofc X fc, Q = nxr , and S = rxr . 




3 


for r = to r — 1 do 




4 


Set L r = r — \frk. 




5 


Pick index i G {1, 2, n} and t such that U(ei) < \ < 


L(yi, B t _i,l t ). 


6 


Update B T = B T _i + tvivj. Set Qi tT +i = 1 and S r+ i )T 


f i = i/Vt. 


7 


end for 




8 


Return: Sampling and rescaling matrices 0, G R nxr , S G 


l rxr . 



Algorithm 2: DeterministicSampling [2] 



2 Algorithms 

Our deterministic algorithm (Theorem 1). Algorithm 1 deterministically selects r columns 
of A to form C and the corresponding sparse vector x r . The meat of this method is the subroutine 
DeterministicSampling, which is an algorithm to simultaneously sample the columns of two matrices, 
while controlling their spectral and Frobenius norms. DeterministicSampling takes input two matrices 
V T G R fcxn and E G M. mxn . We assume that V is orthonormal, so V T V = L> To describe 
the algorithm, it is convenient to view these two matrices as two sets of n column vectors, V T = 
[vi, . . . , v„] and E = [ei, . . . , e re ]. In our application, V T = and E = A — A&. 

Given k and r, introduce the iterator r = 0, 1, 2, r — 1, and define the parameter L T = r — \frk~. 
For a square symmetric matrix B G M. kxk with eigenvalues Ai, . . . , Afc and L G M, define functions 
'^B) = £- =1 j^, and L(v,B,L) = ^b^Vb] ~ ~ ^h)-^, where l' = L + 1. Also, 



for a vector e, define function U(e) = * % — yk/rj . At every step r, the algorithm selects a 

column with index i for which Ufa) < L(vj,B,L T ). The running time of the method is dominated 
by the search for a column which satisfies U < L. To compute L, one needs (j)(h, B), and hence the 
eigenvalues of B, and (B — l'I^) -1 . This takes 0(k 3 ) time once per iteration, for a total of 0(rk 3 ). 
Then, for i = 1, . . . ,n, we need to compute L for every Vj. This takes 0(nk 2 ) per iteration, for a 
total of 0(nrk 2 ). To compute U, we need e^e^ for i = 1, . . . ,n which takes 0(mn). So, in total 
DeterministicSampling takes 0(nrk 2 + mn), hence Algorithm 1 needs 0(mnmm{m,n} + nfe 3 /e 2 ). 

DeterministicSampling selects vectors using a greedy procedure such that the sampled vectors 
satisfy the bounds of the Lemma 4 below. The bounds of Lemma 4 along with the structural bound 
of Lemma 3 immediately give the result in Theorem 1. 

Lemma 4 ([2]). DeterministicSampling constructs matrices rZ, S such that, 

||(V T ftS) + || 2 <\-Jt ||E«S|| F < ||E|| F . 
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1: Input: A G M mxn , target rank k < rank(A), and accuracy parameter < e < 1/2. 
2: Compute V fc G M. nxk from the SVD of A. 

3: Let r = \ 36k ln(20fc)/e 2 ] , [Q, S] = RandomSampling(V£, r), and C = XQS G M mxr . 
4: Let x r = C + b G ffi r ; and construct x r G W 1 from x r at the indices corresponding to 

the selected columns C and elsewhere. 
5: return x r G M n with at most r non-zero entries. 



Algorithm 3: A Randomized Sparse Solver for Least-Squares 



Our randomized algorithm (Theorem 2). Algorithm 3 is similar to Algorithm 1. The main 
qualitative differences are that it only needs and it uses a randomized algorithm to sample the 
columns. The basic idea in RandomSampling is to compute probabilities pi = ||vj W^/k for i = 1, . . . , n, 
and then sample r columns in r i.i.d. trials. In each trial, a column is sampled according to these 
probabilities. The running time of RandomSampling is 0(nk + r log(r)), so the total running time of 
Algorithm 3 is 0(mnm.in{m,n} + klogklog(k/e)/e 2 ). 

As with Lemma 4, we are going to need some properties of the sampling and rescaling matrices 
that RandomSampling delivers. This random sampling algorithm was introduced in [9]. Lemma 5 
is an application of this algorithm for sampling columns from matrices of orthonormal rows, while 
Lemma 6 is a simple corollary of Lemma 5. Lemma 8 is a simple fact that proved recently in [5]. 
Finally, Lemma 9 is a direct application from a matrix multiplication result in [5]. 

Lemma 5 ([9]). If r > 4kln(2k/5)/e 2 , then, w.p. at least I -5: || V T QSS T ft T V - I fc || 2 < e. 

Lemma 6 (Corollary of Lemma 5). If r = Akln(2k/5)/e 2 ), then w.p. at least 1 — 5, 
||(V T fiS)+ - (V T QS) T || 2 < -4-.. 



|Ens||| 



Lemma 7 ([5], eq. (36)). For any matrix E G M. mxn , E 

Corollary 8 (By Markov's inequality). For any E G R mxn , w.p. 1-5 ||EfiS||^ < j||E|||,. 
Lemma 9 ([5]). For 1 < r < n and any E with EV = mxk , w.p. 1-5, ||EfiSS T T V||| < 
Proof. We apply eqn. (4) of Lemma 4 in [5], with E, V, and EV = mX k. 

ieWii'iiv^ 



E [||EftSS T ft T V||!] < V 



(i)H 2 



After substituting^ = ||V(j) || 2 //c, and using Markov's inequality, the result follows. 



Input: V T = [vi, . . . , v n ] G R kxn with Vj G R k and r > 4k In k. 

In 1 1 2 

For i = l,...,d compute probabilities pi = r||vi|| 2 . 
Initialize 0, = nxr and S = rX r- 
for r = 1 to r do 

Select index % G {1, 2, n} i.i.d. with the probability of selecting index i being pi 
Set £li :T = 1 and S Tjr = 1/^/pir. 
end for 

Return: Sampling and rescaling matrices G 1R' 1XT ', S G M rxr 



Algorithm 4: RandomSampling [9] 
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3 Proofs 

3.1 Proof of Theorem 1 

By Lemma 4, rank(V^OS) = k (to see this, use ||(Vfcf2S) + || 2 = l/<7fc(VfcOS) and r > k), so we can 
apply Lemma 3: 

||Ax r - b|| 2 < \\Ax% - b|| 2 + || (A - A fc ) fiS(V^S) + S" 1 U^b|| 2 . 
We manipulate the second term of this equation as follows. First, notice that 

|| (A - A fc ) OS(Vfc r fiS)+S- 1 Ufc r b|| 2 = || (A - Afc) OS(Vfc r OS)+S- 1 Ufc r b|| F . 

Then, 

||(A-Afc)fiS(Vfc r OS) + S fc - 1 Ufc r b|| F < ||(A-Afc)OS(Vfc r fiS) + || F ||S fc - 1 Ufc r b|| 2 (a) 

< ||(A-Afc)OS|| F ||(Vfc r OS) + || 2 a fe - 1 (A)||b||2 (b) 

< ||A-Afc|| F (l-^A : )" 2 a,T 1 (A)||b|| 2 (c) 

< (1 + e)||A - A fe || F( r- 1 (A)||b|| 2 . (d) 



(a) follows by spectral submultiplicativity; (b) follows by submultiplicativity and using ||U fe || 2 = 1 
(c) follows by the bounds in Lemma 4; finally, (d) follows after some algebra, because r = \9k/ 



f 2 



3.2 Proof of Theorem 2 

The basic idea of the proof is similar, except that we now use the lemmas corresponding to Random- 
Sampling. Let 5 = 1/10 and assume for the moment that r = |~ 4A; ln(2/c /5)/e 2 ] (rescaling e below 
will give the value of r in the theorem); the sampling and rescaling matrices f2,S are returned by 
RandomSampling(Vfc , r) (see also Section 2). Then, C = AS1. By the union bound, with probability 
at least 1 — 35 = 0.7, the bounds in Lemmas 5, 9 and Corollary 8 all hold. Since Lemma 5 implies 
Lemma 6 and 9, all four of these bounds hold. The remainder of the proof assumes that we are in 
this 0.7 probability event when all four bounds hold. 

From Lemma 5, rank(VfcfiS) = k, so Lemma 3 gives (recall that E = A — Afc): 

||Ax r - b|| 2 < ||Axfc - b|| 2 + ||EOS(Vfc r fiS) + S^ 1 Ufc r b|| F . 

As with the proof of Theorem 1, we manipulate the second term as follows: 

||EOS(Vfc r f]S) + S^ 1 Ufc r b|| F < ||EfiS(Vfc r f]S) + || F ||S fc - 1 Ufc r b|| 2 

< ||EfiS(Vfc r fiS) + || F -^ 1 (A)-||b|| 2 . 

We now bound || EOS ( V^QS) + 1| F as follows. 

||EOS(Vfc r fiS)+|| F < ||EOSSfi T Vfc|| F + ||EfiS((Vfc r OS)+ - (Vfc r OS) T )|| F 

< \\E\\ Fy /k/r5+ ||E|| F e/ v / 5(l - e) 

< 3e||E|| F . 

The first inequality follows from the triangle inequality, and the second follows after applying 
Lemma 9 to the first term, spectral submultiplicativity, Lemma 6, and Corollary 8; the last in- 
equality follows from elementary algebra, because e < 1/2. Rescale e — > e/3, to wrap up. ■ 
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3.3 Proof of Lemma 3 



We will prove a considerably more general result, of which Lemma 3 is a simple corollary. Be- 
low, we first introduce a general matrix approximation problem and present an algorithm and an 
approximation result for this problem (Lemma 10). Lemma 3 is a simple corollary of Lemma 10. 

Let B £ ]j mxw kg a matrix which we would like to approximate; let A £ l mx " be the matrix 
which we will use to approximate B. Specifically we want a sparse approximation of B from A, 
which means that we would like to choose C £ W nxr consisting of r < n columns from A such 
that ||B — CC + B|| F is small. If A = B (approximating B using the columns of B), then, this is 
the column based matrix approximation problem, which has received much interest recently [3, 2]. 
The more general problem which we study here, with A ^ B, takes on a surprisingly more difficult 
flavor. Our motivation is regression, but the problem could be of more general interest. We will 
approach the problem through the use of matrix factorizations. For Z £ M nxfc , with Z T Z = Ifc, let 
A = HZ T + E, where H £ R mxk ; and, E £ M. mxn is the residual error of the factorization. For fixed 
A and Z, \\E\\ 6 (f = 2,F) is minimized when H = AZ. Let £ R nxr , S £ W xr be sampling and 
rescaling matrices, respectively, and let C = AJ7S £ M mxr . 

Lemma 10 (Generalized Column-based Matrix Approximation). 7/'rank(Z T ilS) = k, then, 
|| B - CC + B|| 5 < ||B - HH+B|| 5 + ||Ef}(Z T ft)+H + B|| s . 

Proof. 

||B - CC + B|| ? < ||B - C(Z T ftS) + H + B|| 5 (a) 
= ||B - Af}(Z T ftS) + H + B|| 5 
= ||B - (HZ T + E)ftS(Z T ftS) + H+B|| c 
= ||B - H(Z T OS)(Z T ftS)+H+B + Ef)(Z T OS) + H+B|| 5 
= ||B-HH+B + EftS(Z T S!S) + H + B|| 5 (b) 
< ||B - HH + B|| ? + ||EfiS(Z T f]S) + H+B|| 5 . (c) 

(a) follows by the optimality of C + B; (b) follows because rank(Z T f]S) = k and so Z T ilS(Z T OS) + = 
1^; finally, (c) follows by the triangle inequality of matrix norms. ■ 

This lemma is a general tool for the general matrix approximation problem. It is worth parsing 
this lemma carefully, to understand its implications. The left hand side is the matrix approximation 
of B using the dimensionally reduced C. The right hand side has two terms which highlight some 
tradeoffs: the first term is the approximation of B using H (H is used in the factorization to 
approximate A); the second term is related to E, the residual error in approximating A. Ideally, one 
should choose H and Z to simultaneously approximate B with H and have small residual error E. 
In general, these are two competing goals, and a balance should be struck. For the remainder of this 
work, we focus on the Frobenius norm, and will consider only one extreme of this tradeoff, namely 
choosing the factorization to minimize ||E|| F . Specifically, since Z has rank k, the best choice for 
HZ T which minimizes ||E||f is A&. In this case, E = A - A fe . Via the SVD, A fc = U fc £ fc v£, and 
so A = (UfcSfc)Vj + A — Afc. We can apply Lemma 10, with B = b, H = UfcEfc, Z = Vfc and 
E = A — Afc , giving as a corollary the next lemma. 

Lemma 11. If rank(Vfc S7S) = k, then, 

||b - CC + b|| 2 < ||b - UfcU^bHa + || (A - Afc)fiS(Vfc r OS) + S- 1 Ufc r b|| 2 . 

Note that 

||b- CC + b|| 2 = ||Ax r - b|| 2 , 
for x r £ W 1 constructed from this C and ||b — UfcUfc b|| 2 = ||b — Ax£||, which give Lemma 3. 
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4 Related work 



Our results can be viewed as extensions of similar results obtained before using the so-called Rank- 
Revealing QR (RRQR) factorization [4]. For fixed A,b, and k, the authors of [4] use a QR-like 
decomposition to select exactly k columns of A and compare their sparse solution vector xj. (r = k 
in this case) with x£; notice that we compare the corresponding values || Ax^ — t> 1 1 2 and || Ax£ — 1> 1 1 2 - 
More specifically, Eqn. (12) of [4] along with the Strong RRQR results of [7] imply that 



A-CC+A 2 / „ A+ , „ b-Axt 
x fc 2 < ... " - • 2 A+b 2 + - 



v /4 fc(re _ fc) + l||A-A fc || 2 ^ 2 || A + b || 2 + lib - Ax*|| 2 



<r k (A) V k ^fc(A) 



< v /4fc(n-fc) + l(T fc+ i(A) / ||b|| 2 | ||b -Ax^|| 2 



o-*(A) V o"*(A) o-fc(A) 



< ^ ( "-^ + 1 .(2||b|| 2 + ||b-Ax;|| 2 ) 

0"fc(A) 

This bound is interesting; however, it only applies to fixed sparsity r = k. Extending the Rank- 
Revealing QR approach for obtaining arbitrary r-sparse solutions is not obvious. Our algorithms 
though allow the user to set the sparsity parameter as large as she likes and trade the accuracy with 
the number of non-zero elements in the solution vector. 
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